Goto

Collaborating Authors

 medical model


AutoMedEval: Harnessing Language Models for Automatic Medical Capability Evaluation

arXiv.org Artificial Intelligence

With the proliferation of large language models (LLMs) in the medical domain, there is increasing demand for improved evaluation techniques to assess their capabilities. However, traditional metrics like F1 and ROUGE, which rely on token overlaps to measure quality, significantly overlook the importance of medical terminology. While human evaluation tends to be more reliable, it can be very costly and may as well suffer from inaccuracies due to limits in human expertise and motivation. Although there are some evaluation methods based on LLMs, their usability in the medical field is limited due to their proprietary nature or lack of expertise. To tackle these challenges, we present AutoMedEval, an open-sourced automatic evaluation model with 13B parameters specifically engineered to measure the question-answering proficiency of medical LLMs. The overarching objective of AutoMedEval is to assess the quality of responses produced by diverse models, aspiring to significantly reduce the dependence on human evaluation. Specifically, we propose a hierarchical training method involving curriculum instruction tuning and an iterative knowledge introspection mechanism, enabling AutoMedEval to acquire professional medical assessment capabilities with limited instructional data. Human evaluations indicate that AutoMedEval surpasses other baselines in terms of correlation with human judgments.


IIMedGPT: Promoting Large Language Model Capabilities of Medical Tasks by Efficient Human Preference Alignment

arXiv.org Artificial Intelligence

Recent researches of large language models(LLM), which is pre-trained on massive general-purpose corpora, have achieved breakthroughs in responding human queries. However, these methods face challenges including limited data insufficiency to support extensive pre-training and can not align responses with users' instructions. To address these issues, we introduce a medical instruction dataset, CMedINS, containing six medical instructions derived from actual medical tasks, which effectively fine-tunes LLM in conjunction with other data. Subsequently, We launch our medical model, IIMedGPT, employing an efficient preference alignment method, Direct preference Optimization(DPO). The results show that our final model outperforms existing medical models in medical dialogue.Datsets, Code and model checkpoints will be released upon acceptance.


Introducing the Large Medical Model: State of the art healthcare cost and risk prediction with transformers trained on patient event sequences

arXiv.org Machine Learning

With U.S. healthcare spending approaching $5T (NHE Fact Sheet 2024), and 25% of it estimated to be wasteful (Waste in the US the health care system: estimated costs and potential for savings, n.d.), the need to better predict risk and optimal patient care is evermore important. This paper introduces the Large Medical Model (LMM), a generative pre-trained transformer (GPT) designed to guide and predict the broad facets of patient care and healthcare administration. The model is trained on medical event sequences from over 140M longitudinal patient claims records with a specialized vocabulary built from medical terminology systems and demonstrates a superior capability to forecast healthcare costs and identify potential risk factors. Through experimentation and validation, we showcase the LMM's proficiency in not only in cost and risk predictions, but also in discerning intricate patterns within complex medical conditions and an ability to identify novel relationships in patient care. The LMM is able to improve both cost prediction by 14.1% over the best commercial models and chronic conditions prediction by 1.9% over the best transformer models in research predicting a broad set of conditions. The LMM is a substantial advancement in healthcare analytics, offering the potential to significantly enhance risk assessment, cost management, and personalized medicine.


Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?

arXiv.org Artificial Intelligence

Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining (DAPT) improves performance on downstream medical tasks, such as answering medical licensing exam questions. In this paper, we compare seven public "medical" LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting regime for medical question-answering (QA) tasks. For instance, across the tasks and model pairs we consider in the 3-shot setting, medical LLMs only outperform their base models in 12.1% of cases, reach a (statistical) tie in 49.8% of cases, and are significantly worse than their base models in the remaining 38.2% of cases. Our conclusions are based on (i) comparing each medical model head-to-head, directly against the corresponding base model; (ii) optimizing the prompts for each model separately; and (iii) accounting for statistical uncertainty in comparisons. While these basic practices are not consistently adopted in the literature, our ablations show that they substantially impact conclusions. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.


The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models

arXiv.org Artificial Intelligence

Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining (DAPT) improves performance on downstream medical tasks, such as answering medical licensing exam questions. In this paper, we compare ten public "medical" LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting and supervised fine-tuning regimes for medical question-answering (QA). For instance, across all tasks and model pairs we consider in the 3-shot setting, medical LLMs only outperform their base models in 22.7% of cases, reach a (statistical) tie in 36.8% of cases, and are significantly worse than their base models in the remaining 40.5% of cases. Our conclusions are based on (i) comparing each medical model head-to-head, directly against the corresponding base model; (ii) optimizing the prompts for each model separately in zero-/few-shot prompting; and (iii) accounting for statistical uncertainty in comparisons. While these basic practices are not consistently adopted in the literature, our ablations show that they substantially impact conclusions. Meanwhile, we find that after fine-tuning on specific QA tasks, medical LLMs can show performance improvements, but the benefits do not carry over to tasks based on clinical notes. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.


A Large Medical Model based on Visual Physiological Monitoring for Public Health

arXiv.org Artificial Intelligence

The widespread outbreak of the COVID-19 pandemic has sounded a warning about the globalization challenges in public health. In this context, the establishment of large-scale public health datasets, of medical models, and of decision-making systems with a human-centric approach holds strategic significance. Recently, groundbreaking advancements have emerged in AI methods for physiological signal monitoring and disease diagnosis based on camera sensors. These approaches, requiring no specialized medical equipment, offer convenient manners of collecting large-scale medical data in response to public health events. Not only do these methods facilitate the acquisition of unbiased datasets, but also enable the development of fair large medical models (LMMs). Therefore, we outline a prospective framework and heuristic vision for a public health large medical model (PHLMM) utilizing visual-based physiological monitoring (VBPM) technology. The PHLMM can be considered as a "convenient and universal" framework for public health, advancing the United Nations' "Sustainable Development Goals 2030", particularly in its promotion of Universal Health Coverage (UHC) in low- and middle-income countries. Furthermore, this paper provides an outlook on the crucial application prospects of PHLMM in response to public health challenges and its significant role in the field of AI for medicine (AI4medicine). In summary, PHLMM serves as a solution for constructing a large-scale medical database and LMM, eliminating the issue of dataset bias and unfairness in AI models. The outcomes will contribute to the establishment of an LMM framework for public health, acting as a crucial bridge for advancing AI4medicine.


SAMM (Segment Any Medical Model): A 3D Slicer Integration to SAM

arXiv.org Artificial Intelligence

The advent of large language models (LLM) has led to significant progress in image analysis with potential for future advancements. SAM [Kirillov et al., 2023] is a revolutionary foundation model for image segmentation and has already shown the capability of handling diverse segmentation tasks. SAM especially prevails in zero-shot domain generalization cases compared with the existing elaborate, fine-tuned models trained on specific domains. An important prospect for the application of SAM would be its adaptation to the complex task of segmenting medical images with significant inter-subject variations and a low signal-to-noise ratio. The segmentation task allows separation of different structures in medical images, which are then used to detect the region of interest or reconstruct multi-dimensional anatomical models [Sinha and Dolz, 2021]. The existing AI-based segmentation methods, however, do not fully bridge the domain gap among different imaging modalities, such as computed tomography (CT), magnetic resonance imaging (MRI), or ultrasound (US) [Wang et al., 2020]. The domain gap refers to the difference in the data format across various image modalities, as each modality offers a distinct advantage in visualizing anatomical structures and related pathologies (e.g., tumor, bone fracture). This difference introduces specific challenges for training AI systems to perform common analysis without the need for a comprehensive dataset that includes all relevant domains from various image modalities.


Definition drives design: Disability models and mechanisms of bias in AI technologies

arXiv.org Artificial Intelligence

The increasing deployment of artificial intelligence (AI) tools to inform decision making across diverse areas including healthcare, employment, social benefits, and government policy, presents a serious risk for disabled people, who have been shown to face bias in AI implementations. While there has been significant work on analysing and mitigating algorithmic bias, the broader mechanisms of how bias emerges in AI applications are not well understood, hampering efforts to address bias where it begins. In this article, we illustrate how bias in AI-assisted decision making can arise from a range of specific design decisions, each of which may seem self-contained and non-biasing when considered separately. These design decisions include basic problem formulation, the data chosen for analysis, the use the AI technology is put to, and operational design elements in addition to the core algorithmic design. We draw on three historical models of disability common to different decision-making settings to demonstrate how differences in the definition of disability can lead to highly distinct decisions on each of these aspects of design, leading in turn to AI technologies with a variety of biases and downstream effects. We further show that the potential harms arising from inappropriate definitions of disability in fundamental design stages are further amplified by a lack of transparency and disabled participation throughout the AI design process. Our analysis provides a framework for critically examining AI technologies in decision-making contexts and guiding the development of a design praxis for disability-related AI analytics. We put forth this article to provide key questions to facilitate disability-led design and participatory development to produce more fair and equitable AI technologies in disability-related contexts.